{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "### Dataset: Alice Dataset\n",
        "**Link: https://www.gutenberg.org/files/11/11-0.txt**"
      ],
      "metadata": {
        "id": "OF159UKJuKSU"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## N-gram language model:"
      ],
      "metadata": {
        "id": "G5eCmZSPuT7E"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach**: First context size and embedding dimension (which the domain in vector field) will be initialized."
      ],
      "metadata": {
        "id": "_CKH56bMuZq9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "contextSize = 3\n",
        "embeddingDimension = 20"
      ],
      "metadata": {
        "id": "jD4YusZ8uUiT"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** A sample text will be processed which will be considered as training data, using this text segment the vectorization scale can be understood."
      ],
      "metadata": {
        "id": "DonAAarpuecq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "test_sentence = \"\"\"India, officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area, the most populous country\n",
        "in the world, and the most populous democracy.\"\"\""
      ],
      "metadata": {
        "id": "i80xsb6Dup-U"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:**  We will split the test in training and testing for better model evaluation and generalization purpose."
      ],
      "metadata": {
        "id": "P254BdVGuyf8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "ngram_model = [([test_sentence[m - n - 1] for n in range (contextSize)], test_sentence[m])\n",
        "                for m in range(contextSize, len(test_sentence))]"
      ],
      "metadata": {
        "id": "MhBjSNDTusvF"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:**  First four segment of the model will be printed in chronological order after the training process is complete."
      ],
      "metadata": {
        "id": "tVpmx_Evu4s9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(ngram_model[:1])\n",
        "print(ngram_model[:2])\n",
        "print(ngram_model[:3])\n",
        "print(ngram_model[:4])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6QwSdX_Tu2eL",
        "outputId": "d2c390fa-8b8d-4e5b-cf6d-13c91aab1cba"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[(['d', 'n', 'I'], 'i')]\n",
            "[(['d', 'n', 'I'], 'i'), (['i', 'd', 'n'], 'a')]\n",
            "[(['d', 'n', 'I'], 'i'), (['i', 'd', 'n'], 'a'), (['a', 'i', 'd'], ',')]\n",
            "[(['d', 'n', 'I'], 'i'), (['i', 'd', 'n'], 'a'), (['a', 'i', 'd'], ','), ([',', 'a', 'i'], ' ')]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "vocab = set(test_sentence)\n",
        "wordIxConversion = {word: m for m, word in enumerate(vocab)}"
      ],
      "metadata": {
        "id": "kbURav3zu8Gt"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** Necessary libraries will be imported to implement N gram language model."
      ],
      "metadata": {
        "id": "9JE7_xrFvAdD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "## Import libraries\n",
        "import torch\n",
        "import torch.nn as tnn\n",
        "import torch.optim as toptim\n",
        "import torch.nn.functional as tfunction"
      ],
      "metadata": {
        "id": "OlIIgv5Uu-ek"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** Using the nn module the N gram modeling class will be implemented.\n",
        "\n",
        "There will be segment of the total class, one initializing the context and embedded(vectorized) dimension and other indicating the forward propagation through the network."
      ],
      "metadata": {
        "id": "GqFJSqdwvGC9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "class nGramModelling(tnn.Module):\n",
        "    def __init__(self, vocab_dimension, embedding_dimension, context_dimension):\n",
        "        super(nGramModelling, self).__init__()\n",
        "        self.embedding = tnn.Embedding(vocab_dimension, embedding_dimension)\n",
        "        self.linear_first = tnn.Linear(context_dimension * embedding_dimension, 119)\n",
        "        self.linear_second = tnn.Linear(119, vocab_dimension)\n",
        "    def forward(self, training_set):\n",
        "        embedded_form = self.embedding(training_set).view((1, -1))\n",
        "        forward_out = tfunction.relu(self.linear_first(embedded_form))\n",
        "        forward_out = self.linear_second(forward_out)\n",
        "        log_propagation = tfunction.log_softmax(forward_out, dim=1)\n",
        "        return log_propagation"
      ],
      "metadata": {
        "id": "ShyrbMWxvD6s"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "lossFunction = tnn.NLLLoss()\n",
        "calculated_loss = []\n",
        "model_structure = nGramModelling(len(vocab), embeddingDimension, contextSize)\n",
        "optimization = toptim.SGD(model_structure.parameters(), lr=0.002)"
      ],
      "metadata": {
        "id": "j1ekHj8kvJwU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** The error generated through forward propagation will be calculated and using gradient decent the more optimized approach will be implemented."
      ],
      "metadata": {
        "id": "zuFbqdLrvM3k"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for tempVariable in range(22):\n",
        "    totalLoss = 0\n",
        "    for context, target in ngram_model:\n",
        "        model_structure.zero_grad() ## Gradient initilization to zero\n",
        "        context_idxs = torch.tensor([wordIxConversion[w] for w in context]) ## Integer index mapping\n",
        "        log_propagation = model_structure(context_idxs) ## Forward pass\n",
        "        loss = lossFunction(log_propagation, torch.tensor([wordIxConversion[target]])) ## Calculate loss function\n",
        "        optimization.step() ## Update gradient\n",
        "        loss.backward() ## Backward pass\n",
        "        totalLoss += loss.item() ## Calculate total loss\n",
        "    calculated_loss.append(totalLoss)"
      ],
      "metadata": {
        "id": "jvsU8Kq5vLdr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** The total loss will be prined."
      ],
      "metadata": {
        "id": "PzDOJrolvTZs"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(calculated_loss) ## Print the loss value"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jlRtc1-1vQoc",
        "outputId": "7fb0030e-583d-4586-9f6a-243ce6d4b2cf"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** To test the model and evaluate the generalization approach some vectorized form of letter of printed."
      ],
      "metadata": {
        "id": "HXTxGqm7vbtk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(model_structure.embedding.weight[wordIxConversion[\"i\"]]) ## Print the embedded form of 'i'"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "o0Pa4RhzvXHz",
        "outputId": "b76a019e-d082-438d-a7e5-c7a6c0288079"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "tensor([-0.1026,  0.1578,  0.9792,  0.5220,  0.3054,  1.8454, -0.0544, -0.2808,\n",
            "         2.1645,  1.6299,  0.9116,  1.1881,  0.5856,  0.3783,  0.2527,  1.3072,\n",
            "        -0.5567, -0.4364,  0.7162,  0.0098], grad_fn=<SelectBackward0>)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print(model_structure.embedding.weight[wordIxConversion[\"n\"]]) ## Print the embedded form of 'n'"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Socnzyl5vlAk",
        "outputId": "25b0f617-ea9d-43f2-9dbc-c1fdf38b92b1"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "tensor([ 1.0074,  0.9762,  0.8122,  0.8563,  0.1722, -0.7613, -1.7372,  0.3524,\n",
            "        -0.4371,  1.2262, -0.2154, -1.0835,  1.6382,  0.1090,  0.9307,  0.8007,\n",
            "        -0.2892,  0.9695, -1.2736,  0.5683], grad_fn=<SelectBackward0>)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print(model_structure.embedding.weight[wordIxConversion[\"d\"]]) ## Print the embedded form of 'd'"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8eKBIAQ0vpdV",
        "outputId": "d74d570d-90e9-4ec2-e588-31524339246d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "tensor([-0.5240, -0.0740, -0.9617, -1.7470, -0.7892, -0.1827, -0.4542,  0.5034,\n",
            "         0.7301,  0.7181, -1.4437, -0.6801,  0.3338, -0.2508, -0.3352, -0.4485,\n",
            "         0.6673,  1.2342, -2.1650,  0.1493], grad_fn=<SelectBackward0>)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## CBOW model:"
      ],
      "metadata": {
        "id": "XLS8LzGKvv80"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** In this modelling method, multiple word input vectors will be trained in a single projection and it will be mapped to a specified vectorized output.\n",
        "\n",
        "First the context dimension and main text(training) will be initialized."
      ],
      "metadata": {
        "id": "kyjF_RAEvyna"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "context_dimension = 2  ## 2 words to left and 2 to right\n",
        "mainTextSegment = \"\"\" West Bengal is a state in eastern India, between the Himalayas and the Bay of Bengal. Its capital, Kolkata (formerly Calcutta), retains\n",
        "architectural and cultural remnants of its past as an East India Company trading post and capital of the British Raj. The city's colonial landmarks include\n",
        "the government buildings around B.B.D. Bagh Square, and the iconic Victoria Memorial, dedicated to Britain's queen. \"\"\""
      ],
      "metadata": {
        "id": "oIzDYeFYvtFE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** The text will be considered as array format and the array will be deduplicated. The length which will be used for training purpose will printed."
      ],
      "metadata": {
        "id": "4PxUu8PWwBn0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "## deduplication of array\n",
        "vocab = set(mainTextSegment)\n",
        "vocab_length = len(vocab)\n",
        "print(vocab_length)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HvUXaKYIv_Qq",
        "outputId": "76c76adb-4abf-44e7-ddbf-1d40cdd7a2dd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "44\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "wordIxConversion = {word: m for m, word in enumerate(vocab)}\n",
        "extracted_word = []"
      ],
      "metadata": {
        "id": "FRdC03GpwF3b"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:**  Some specific text area will be extracted from the total text segment for training purpose and other will be for testing purpose to achieve better generalized approach."
      ],
      "metadata": {
        "id": "IGjhmqjnwI40"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for m in range(contextSize, len(mainTextSegment) - contextSize):\n",
        "    context = ([mainTextSegment[m + n + 1] for n in range(context_dimension)] + [mainTextSegment[m - m - 1] for n in range(context_dimension)])\n",
        "    targetSegment = mainTextSegment[m]\n",
        "    extracted_word.append((context, targetSegment))"
      ],
      "metadata": {
        "id": "6fRTJqYvwHRL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** The first five extracted segment of the training text will be printed."
      ],
      "metadata": {
        "id": "vE4ac5AhwOG8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(extracted_word[:1])\n",
        "print(extracted_word[:2])\n",
        "print(extracted_word[:3])\n",
        "print(extracted_word[:4])\n",
        "print(extracted_word[:5])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ysyaNeR-wMwD",
        "outputId": "34b23ae8-16c8-443c-b29f-09f65eca5ccd"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[(['t', ' ', ' ', ' '], 's')]\n",
            "[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't')]\n",
            "[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't'), (['B', 'e', ' ', ' '], ' ')]\n",
            "[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't'), (['B', 'e', ' ', ' '], ' '), (['e', 'n', ' ', ' '], 'B')]\n",
            "[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't'), (['B', 'e', ' ', ' '], ' '), (['e', 'n', ' ', ' '], 'B'), (['n', 'g', ' ', ' '], 'e')]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** The CBOW class will be implemented and the context vectorizer function will be initialized to vectorize the domain."
      ],
      "metadata": {
        "id": "TyQYK9nPwSnc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "class CBOWModelling(tnn.Module):\n",
        "    def __init__(self):\n",
        "        pass\n",
        "    def forward(self, inputs):\n",
        "        pass"
      ],
      "metadata": {
        "id": "hcpg4lMiwRBj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "## Model creation and training\n",
        "def contextVectorization(context, word_to_ix):\n",
        "    vectorized_id = [wordIxConversion[w] for w in context]\n",
        "    return torch.tensor(vectorized_id, dtype=torch.long)"
      ],
      "metadata": {
        "id": "Xo2ASB6pwVzj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:** As the text segment was taken as an array format, so some array index will printed to analyze the vectorized form for that index."
      ],
      "metadata": {
        "id": "TEG9CzujwYwn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "contextVectorization(extracted_word[0][0], wordIxConversion)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OIB0J5t3wXbE",
        "outputId": "40e1d225-69b7-4167-a499-3300fb201f3b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([ 7, 12, 12, 12])"
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "contextVectorization(extracted_word[0][1], wordIxConversion)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "oTe3oq2Hwb6E",
        "outputId": "46b09c45-e05f-4e07-b0c3-276b58b4631d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([28])"
            ]
          },
          "metadata": {},
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "contextVectorization(extracted_word[1][0], wordIxConversion)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3Z5K34Vywd_b",
        "outputId": "f2a3bf27-78c6-4c7d-e43e-41227a6e5bd8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([12, 25, 12, 12])"
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "contextVectorization(extracted_word[1][1], wordIxConversion)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "BxdcfGfYwffD",
        "outputId": "5ccc8041-69ee-4c37-aee5-b21ee2bd53e0"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([7])"
            ]
          },
          "metadata": {},
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Semantic similarity measurement:"
      ],
      "metadata": {
        "id": "vHZyiV1zwivV"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Approach:**\n",
        "For word processing and measurement purpose in vectorized format nltk and gensim will be used.\n",
        "\n",
        "First the libraries will be imported and dataset text file will be processed, then word2vec will be implemented in both the modelling format n gram and CBOW."
      ],
      "metadata": {
        "id": "Q9NvS_a1wl3t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from nltk.tokenize import sent_tokenize, word_tokenize\n",
        "import warnings as warn\n",
        "import nltk\n",
        "nltk.download('punkt')\n",
        "\n",
        "import gensim\n",
        "from gensim.models import Word2Vec\n",
        "\n",
        "warn.filterwarnings(action = 'ignore')\n",
        "\n",
        "\n",
        "## Reads text file dataset\n",
        "sample_text = open('/content/drive/MyDrive/Dataset/alice.txt')\n",
        "extracted_text = sample_text.read()\n",
        "\n",
        "# Reapleces all the escape characters with space segment\n",
        "updated_text = extracted_text.replace(\"\\n\", \" \")\n",
        "\n",
        "data = []\n",
        "\n",
        "# Travarsing through all senteces of dataset\n",
        "for m in sent_tokenize(updated_text):\n",
        "    temp = []\n",
        "\n",
        "    # word tokenization\n",
        "    for n in word_tokenize(m):\n",
        "        temp.append(n.lower())\n",
        "\n",
        "    data.append(temp)\n",
        "\n",
        "# Implementation of N Gram model\n",
        "model_first = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5, sg = 1)\n",
        "\n",
        "# Print results\n",
        "print(\"Cosine similarity between 'alice' \" + \"and 'wonderland' - N Gram : \", model_first.wv.similarity('alice', 'wonderland'))\n",
        "\n",
        "print(\"Cosine similarity between 'alice' \" + \"and 'machines' - N Gram : \", model_first.wv.similarity('alice', 'machines'))\n",
        "\n",
        "# Implementation of  CBOW model\n",
        "model_second = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5)\n",
        "\n",
        "# Print results\n",
        "print(\"Cosine similarity between 'alice' \" + \"and 'wonderland' - CBOW : \", model_second.wv.similarity('alice', 'wonderland'))\n",
        "\n",
        "print(\"Cosine similarity between 'alice' \" + \"and 'machines' - CBOW : \", model_second.wv.similarity('alice', 'machines'))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RMtO5tHwwg-z",
        "outputId": "0312a8fe-161b-449b-d541-b14dd5b056e6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Cosine similarity between 'alice' and 'wonderland' - N Gram :  0.73083436\n",
            "Cosine similarity between 'alice' and 'machines' - N Gram :  0.8429953\n",
            "Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.97775567\n",
            "Cosine similarity between 'alice' and 'machines' - CBOW :  0.933941\n"
          ]
        }
      ]
    }
  ]
}